5 research outputs found

    A reference haplotype panel for genome-wide imputation of short tandem repeats.

    Get PDF
    Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNP + STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits

    Deep Characterization of the Contribution of Short Tandem Repeats Across Tissues

    No full text
    High-Throughput Sequencing (HTS) and Genome-Wide Association Studies (GWAS) studies have given us unprecedented insights into the influence of Single Nucleotide Variants (SNV) and Copy Number Variants (CNV) on different phenotypes including gene expression, diseases, and complex traits. However, how other complex genetic variations such as Short Tandem Repeats (STRs) in the genome may affect gene expression remains largely unknown. Identifying and genotyping these types of variants from short DNA sequencing reads or low coverage data present difficult bioinformatics challenges. Additionally, traditional association tests must be modified to handle highly multi-allelic loci such as STRs. Several studies have examined the effect of STRs on gene expression genome-wide. However, these studies were restricted to a single cell type such as whole blood or lymphoblastoid cell lines (LCLs) and had limited power to detect associations due to low-quality genotypes. Thus, the results of these studies have had limited biological insights and interpretation in different contexts. In this dissertation, we address the importance of incorporating STRs in causal screening and large-scale medical genetics studies. We perform the first and largest yet characterization of STRs that contribute to gene expression variation across multiple tissues. To assure robust and reliable outcomes and insights, we leverage data from the GTEx project, which has collected high coverage whole genome sequencing data and RNA-sequencing across dozens of tissues, for more than 600 individuals. Our work confirms a clear contribution of STRs to gene expression regulation, with 25,554 eSTRs identified across 17 tissues. Of these, 14% are identified as high confidence causal variants after fine-mapping against nearby SNPs. eSTRs are highly enriched at predicted promoter and enhancer regions and for motifs with high GC-content. We identified a subset of eSTRs capable of forming G-quadruplexes (G4), a highly stable DNA secondary structure known to be involved in gene regulation. We show that long G4-forming STRs tend to increase expression of nearby genes, potentially by lowering the free energy of promoter regions and promoting RNA polymerase II stalling. Finally, we identify high-confidence eSTRs that likely underlie previously identified genetic associations with complex phenotypes including schizophrenia and blood-related traits
    corecore